-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat(ingest): Implement first version of project config handler #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implemented the basic version of relay project config handler which splits the project keys across multiple cells. Right now it does a round robin split. Will change to lookup into locator later.
|
|
||
| Response::builder() | ||
| .status(StatusCode::OK) | ||
| .header(CONTENT_TYPE, "application/json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to copy any headers from the responses we receive from relay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code to handle the response headers. They are now added from the first upstream which makes a successful response.
| #[serde(flatten)] | ||
| pub extra: HashMap<String, JsonValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see no extra anywhere in this file
https://github.com/getsentry/sentry/blob/master/src/sentry/api/endpoints/relay/project_configs.py
am i looking in the wrong place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of extra is to be forward compatible with the protocol. Any field which the ingest-router does not care about (i.e the fields should be opaque) is stashed into extra. It's defined as serde(flatten), so it will be handled appropriately during both serialization and deserialization.
| /// | ||
| /// See module-level docs for complete protocol details, implementation strategy, | ||
| /// and request/response flow diagrams. | ||
| pub struct RelayProjectConfigsHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there will be a number of handlers that follow this pattern could this be a trait to enforce they all follow similar patterns?
e.g. something like
trait Handler<Req: Serialize + Deserialize, Res: Serialize + Deserialize> {
fn split_requests(request: Req) -> Vec<Req>,
fn merge_results(results: Vec<Res>) -> Res,
}
where Req and Res can be anything that implement serialize+deserialize. I think this forces the implementation of each endpoint to conform to some basic patterns and look the same which is nice for readability. It could enable more of the the common parts - spawning multiple requests and waiting for them, timeouts, error handling/backpressure to be done in a central place outside of the specific handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know of more use cases where this may be needed right now. If we see that use case pop up, then I can make this a generic trait and have multiple implementations. But I would rather do it when I see the need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other parts of the handler code looks like they are already built with the assumption that there would be many handlers though.
e.g. https://github.com/getsentry/synapse/blob/816dfb3b85e06b1ad1e2a2ffaa5479e615619950/ingest-router/src/config.rs#L50C10-L53 -- wouldn't there be many members on this enum, each of which would map 1:1 with a separate handler?
As for other use cases -- there is this list of endpoints in getsentry, which I would assume would each be separate handler implementations here
https://github.com/getsentry/sentry/blob/c3e99bba9310b1b8385ac04dae75baaaadfe61fe/src/sentry/api/urls.py#L1087-L1128
Motivation for this: You can see in the diff of this PR the complexity of what you have implemented in relay_project_config_handler module, as well as all the conversations here around timeout handling and things like that. It seems like it's easy to get wrong, and I don't think we'd want people to implement that again for every endpoint that needs to be added to synapse now and in the future?
| mut handle, | ||
| public_keys, | ||
| } = task_with_keys; | ||
| let task_result = timeout(Duration::from_secs(30), &mut handle).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of wonder if we need more than 1 timeout value for various scenarios. The specific cases i'm thinking about are:
- i think a really long value like 30 secs is fine if we have no responses from any cell -- without any data relay is making no progress anyway so it's not any worse to wait longer for a response
- If 1 cell returns in say 200 milliseconds, and the other cell is down, waiting for the full 30 seconds for the other cell may be a serious problem. Since the requests are chunked this become a very large amount of 30 second waits back to back. If we returned sooner, the cell that was functional would be able to make progress much sooner. I'm worried that such a long wait for an unresponsive cell may risk effectively taking down all other cells as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the capability to do adaptive timeout. By default, the slower tasks get 5 more seconds than the fastest task to complete its work. Though, I would imagine when we roll this out in production we would need to have a higher value since we want to avoid the scenario of an empty cell always returning first and never letting the existing cells respond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sort of wonder if we need more than 1 timeout value for various scenarios.
Having multiple configuration points for different timeouts seems wise. Waiting here would be sequential which isn't ideal. Perhaps we could combine a join_set to wait on the task futures concurrently, and a timeout so that we get as many responses as possible within the 30s timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a naive implementation right now. Working on making it robust.
| /// Failed keys are added to the pending array for retry. | ||
| /// | ||
| /// Global config is selected from the highest priority cell based on cells.cell_list order. | ||
| async fn collect_and_merge(&self, tasks: Vec<TaskWithKeys>, cells: &Cells) -> MergedResults { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm reading correctly doesn't this mean that we are waiting for each task and applying the modified timeout in order that cells are defined?
- so if cell 1 returns quickly, and cell 2 is down -> we correctly apply the shorter timeout to the second request and return to relay faster
- if cell 2 returns quickly and cell 1 does not -> we still wait for cell 1 for the 30 seconds (the longer timeout), even though there was a cell 2 response much sooner.
| let now = Instant::now(); | ||
| if now >= deadline { | ||
| // Deadline already passed, use minimal timeout | ||
| Duration::from_millis(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there something special about 1 millisecond here? curious why this isn't Duration::ZERO
| })?; | ||
|
|
||
| // Split publicKeys across upstreams | ||
| let split_requests = self.split_keys_by_upstream(&request_data, cells); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's not needed afterward, you could pass the owned request_data into this function and save yourself a clone
| for (name, value) in base_request.headers() { | ||
| req_builder = req_builder.header(name, value); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you planning to include any of the header transformations that the proxy does here? e.g. adding the via and stripping hop by hop headers
| let mut headers_by_cell: HashMap<String, HeaderMap> = HashMap::new(); | ||
| let mut additional_deadline: Option<Instant> = None; | ||
|
|
||
| for task_with_keys in tasks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using a tokio::JoinSet here
it allows you to easily get the first result with the global timeout regardless of the cell order, e.g.
let first = timeout(global_timeout, task_set.join_next()).await;
then you can loop through and keep calling next with the smaller timeout
let next = timeout(reduced_timeout, task_set.join_next()).await;
This way the tasks return in the order they are completed
Implemented the basic version of relay project config handler which splits the project keys across multiple cells. Right now it does a round robin split. Will change to lookup into locator later.